-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QNN][TFLite] TFLite rounding mode support #4828
Conversation
Thanks @LiangHao151941 ! I haven't reviewed the code, however, I have some high level comments:
|
Makes sense! I'll follow up on those. @FrozenGene |
Thanks for this PR. It will help reduce/remove those 1-off deviations. One request that I have is - While setting the rounding mode to TFLite in TFLite parser, it might be better to set it by adding an extra arg in @u99127 @mbarrett97 will also be interested in this. Also, I think we can have small PRs merged. And incrementally improve test coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the scope of this PR. IMO, we can get this in. And send other PRs for TFLite parser.
Yes, this is my thought too. I am ok with it.
IMO, I would prefer adding TFLite parser in this PR too. Because this will be organized into one complete solution for TFLite rounding. When others review the history, they don't need to check isolated PRs together to get complete support for TFLite rounding support. Maybe these PRs during time is long. It is right that we could make small prs and merge it in quickly so that we could get functionality supporting quickly and could locate the reason quickly. However, from this pr's perspective, TFLite's parser code will not be complex and could be contained in. It is acceptable in one single PR from my opinion. So I would suggest @LiangHao151941 adding TFLite parser in this PR too. If we think it is import, we even could port this back to release 0.6, one PR will make us do it easily too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have done the parser part, you could also change the test_forward
for tflite. We shouldn't need atol=1
any more ideally.
Some update on further experiments @FrozenGene @anijain2305
This is added in tflite relay frontend.
For the current label comparison, the new rounding mode works fine. Unfortunately, the new rounding mode does not bring bit-exact execution. In fact, for the 3 qnn models in
This would be essential to further investigate what are the root causes of bit mismatch. Will follow up. |
I thought little more about the bit exact problem. One source of discrepancy for certain is QNN add, and QNN concatenate ops. These call Requantize internally, and they will have default rounding in C++ (UPWARD). For testing, @LiangHao151941 , you can set the C++ default to TFLIte to see if it helps. Meanwhile, we can also think how to can make that C++ rounding visible at python level. |
That is ok. Lets try to see if we can get it in single PR. |
Maybe we could use Another way is we provide |
I like this way
|
cc @LiangHao151941 Could you try this way to do bit-extract testing? For conveniently, you could simply change the C++ default rounding value to "TFLITE" to test. |
just modified add/mul/concat requantize rounding mode and tested, no luck. will change the default rounding behavior for a later test. update: I forced FixedPointMultiply(PerChannel) rounding mode to be TFLITE, but still unable to get bit-exact results. one more thing, setting tflite default rounding mode to TFLITE seems to break GPU test of mobilenet_v2, maybe you guys have any ideas/suggestions? |
:( This will be a challenging task. I think, instead of trying to fix the full e2e test, we should now focus on unit tests. I can help with improving unit test coverage.
Do we run TFLIte tests on GPU? Little confused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't yet have an opinion on these changes but there are coding style related changes that I've noticed which could be explained / improved.
In general I would prefer someone read through the comments to make sure they still held once the changes landed as I don't see them fully updated in the source base.
Let us break the model into layer by layer and compare with tflite. I want to describe my development way, maybe it could help you. For example, we have mobilenetv2 quantized model, you could get the quantized tensorflow and tflite model. Then you could call I think when you verify, you could run on cpu firstly locally to find issue, then consider gpu ci issue. |
That would be appreciated, thanks!
Yes, according to jenkins report. |
Thanks for the detailed instruction here. Will follow up.
No problem. |
ping @LiangHao151941 |
ping @LiangHao151941 please followup |
There is quite some tedious work to do, but thanks for the reminder, will follow up shortly @FrozenGene @tqchen |
67c2be9
to
3fef741
Compare
The problem is located and solved. The discrepancy comes from average pooling calculation when input is integral type. tflite does a UPWARD rounding after the division in avg_pool but current topi implementation of avg_pool does of FLOOR after the division. The fix is added and tested locally on cpu. Now the qnn tests in tflite test_forward has rtol and atol set to 0, which should meet the bit-exact exectution requirement. @FrozenGene @anijain2305 @tqchen @u99127 Please have another look, thanks! |
Thanks @LiangHao151941 ! Maybe we have to handle it a bit more. Please see: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/integer_ops/pooling.h#L300 The default behaviour of TFLite is
We should handle these ops too. |
Thanks for the hard work. For FoldScaleAxis, It will be better we could dig what issue it is (I describe some background of this algorithm here: https://github.com/apache/incubator-tvm/pull/1664/files#diff-88bcab09a087491487d7655fa76ba116R527). I checked CI's docker instance just now. Our CI's LLVM version is LLVM 9 too. So your LLVM is OK IMO. However, I think you could run |
3dfc1f9
to
19c79c7
Compare
Update:
Trying to pull the ci docker to further check @FrozenGene |
473ce0e
to
85e1815
Compare
c024176
to
d6e4bb8
Compare
@LiangHao151941 Was wondering if you ever got a chance to look at the performance of TFLite rounding? I am looking at the performance of UPWARD rounding and surprisingly Requantize was quite expensive in e2e models like mobilenet. I suspect TFLite rounding will be even slower. So, wondering if you got a chance to check performance. I worry that so much efforts in this PR might give us perfect functionality but not so good performance. And therefore, in future, we might decide to tradeoff performance with accuracy, taking us back to square 1. |
IMO, I think we should keep functionality first, then we should consider performance. Currently, our way is to lower many ops, I suspect this leads worse performance, because we can not control all computation in registers. I think TFLite rounding will not the bottleneck of performance. Let us put the effort to keep functionality firstly, next, we should reconsider our code and find the performance bottleneck leverage the performance tool (like DS5 Streamline). |
Actually, Requantize is the bottleneck in my observation on Rasp Pi 4. For TFLite quantized mobilenetv1, currently with master + auto-tuning, I get 56 ms (TFLite is 35 ms). Upon deeper analysis, I found requantize to be the bottleneck (even with UPWARD rounding which is the cheapest). The reason is that calculations happen in int64 and either ARM does not have good instructions or LLVM does not pick up the right instructions. As an experiment, I forced the datatype of Requantize to be int32. This will lead to bad accuracy but it will give us an idea what is the minimum latency of Requantize op. The runtime reduced from earlier 56 ms to 36 ms (almost as good as TFLite). So, requantize is the bottleneck. Additionally, we should notice that TFLite rounding is not a standard rounding. They have moved backwards from the ARM instructions they want to use. Because of that they have 2 roundings instead of 1 - 1 in
This function alone needs around 10 Relay ops (if not more), but in TFLite this is just 1 ARM instruction - VQRDMULH. I doubt that LLVM will be able to use that instruction automatically. In summary, I understand the benefits of exact match between TFLite and TVM. But in contrast to FP32 networks, this is one of the cases, where frameworks have made their choice to tradeoff perf and accuracy for a specific ARM device. Note that with non-TFLite rounding, the application accuracy is still similar, so it might be an acceptable tradeoff. Therefore, I would encourage to think more deeply. I worry that the extensive work going in this PR might need major reconsideration when we dont see good performance improvements. If we relax the exact tensor match criteria, there might be other ways to get the perfromance like using tensorize. |
The vqrdmulh instruction is pretty standard on all Arm CPUs that have Advanced SIMD in 32 bit. C compilers will struggle with generating that instruction automagically. It's just too big an idiom to recognize and I don't think llvm has that infrastructure to detect it . Is there anything preventing us from having a specific tvm intrinsic that we can lower to this vqrdmulh instruction ? Ofcourse if this fails we could instead lower to the C implementation ? The equivalent aarch64 instruction is sqrdmulh IIRC and is default on all devices. There should be an llvm intrinsic equivalent of this. ramana |
Moving to discuss as we are talking a lot :) |
Please note this thread, if we don’t have bit-extract result, we will have many problems, not just trade off. Mobilenet accuracy result is not enough. You could also one blog how do boost the quant perf. Inside it, we could get the bit-extract result and use the same rouding of TFLite, but we could beat TFLite performance a lot. However, the design has a little bit, we design to implement one q_conv2d / q_requantize to contorl it. For q_conv2d, we tensorize it and do requantize inside it. For depthwise convolution, we even not tensorize and use q_conv2d + q_requantize, we could still beat TFLite and QNNPack a lot. Out current design of one problem is we lower too many ops and can not make sure all computations inside register and NCHW has bad cache like previous blog said. We should make sure functionality firstly. Then we consider the performance. I think it is not only the arm instruction you mention but also many other things we need to do. |
Ok. I forgot the accuracy issue of Object detection models. I will also go through the blog to see if we can do something in the short term. If possible, we should make a plan for what all things that we need to do to get the performance speedup. I will share the results in a discussion post in few days. Maybe you can share your experience from there. |
Yeah. Because of the different design, my original plan is we could achieve the same result of TFLite, then we start to consider how to achieve better performance leveraging our previous implementation of |
9e72839
to
833c70d
Compare
1d04749
to
ad3a380
Compare
83434fc
to
960cc5c
Compare
8ef0656
to
a3b5c21
Compare
Hi, I am wondering if we can take the pool operator changes from this PR and send a separate PR. I know that the earlier decision on this was to put everything in one PR for easy backporting. But, as this PR might take more time, and master TVM today does not implement any rounding for pooling ops for integer datatypes, I think it is valuable to separate that part out. What do you think? @FrozenGene @LiangHao151941 |
Unfortunately we need to close this PR due to inactive status. Thanks @Fwd-IV for the original contribution @Fwd-IV @FrozenGene @anijain2305 it would be great if you can coordinate to bring some of the PR up to the mainline and send a new PR. We can use the https://docs.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors mechanism to acknowledge the original author |
Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.
Add tflite rounding support with corresponding test cases. The tflite rounding mode golden results are generated with a testbench using MultiplyByQuantizedMultiplier function here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148
@FrozenGene @anijain2305
Might help fix the problem described here:
https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528